Authors: John Nownes, Sonia Thomas, Yi Hang Khor, Kobe Pranivong

Project Coordinator: John Nownes

Raw Dataset

https://www.kaggle.com/sudalairajkumar/covid19-in-usa#us_states_covid19_daily.csv

Supplementary Data Sources:

Abstract

The dataset we chose for our project looks at the newly-discovered novel coronavirus, which causes coronavirus disease, COVID-19. With nearly all parts of life here in the US and most places around the globe having been affected by the pandemic, it is already easily the most disruptive disease since the Spanish Flu of 1918. On top of the dangerously high number of deaths currently being predicted, the social distancing procedures being implemented to slow the spread of coronavirus appear to be pointing the economy towards a serious economic recession the effects of which may still be felt years from now. Until a vaccine is widely available, it appears that this coronavirus pandemic will be the most important factor affecting the livelihood of nearly every person in this country for several months, and for this reason it is a worthwhile project to study.

Summary Infographic

Figure 1: Featured infographic

The above interactive graphic is our featured infographic that displays the data we explored. Please use the drop down menu on the right to explore different mappings. Some conclusions on this infographic can be found under the Conclusion section.

Description of Background and Questions Raised

The author of the dataset is Sudalai Rajkumar, a popular and highly-rated contributor on Kaggle. Below is his LinkedIn, which displays some of his most impressive accomplishments. From a credibility standpoint, we are reasonably confident that Mr. Rajkumar’s data on the coronavirus is among the most accurate and thorough datasets available to the public. In addition, by checking that some of the cases and testing numbers from his dataset matched the numbers reported by the CDC further illustrated the accuracy of Mr. Rajkumar’s dataset.

https://www.linkedin.com/in/sudalairajkumar/?originalSubdomain=in

Within Mr. Rajkumar’s coronavirus Kaggle page, there are two different .csv files updated daily. One is a file containing coronavirus statistics on the US national level, and the other file contains coronavirus statistics at the US state level. We will be studying the latter file, the .csv file concerning the state-level statistics. We chose the state-level dataset because of our desire to look more closely at the individual states, especially since this pandemic varies widely by location. We also know that there has been a range of different policies put in place at the state level, and we hope to show a summary of the coronavirus situation on a per state level together with the social distancing efforts in that state.

Looking now specifically at our raw state-level .csv file, we see immediately that the dataset is arranged in a “tidy” format, since each observation includes the date of the observation and the state, making these the two keys for that row. As you move along the columns, there are statistics on tests performed cumulatively and on that day, as well as the number of each result (positive, negative, and pending). Additionally, we can see the number of deaths, hospitalizations, recoveries, and ventilators being used.

Finally, this state-level dataset is updated daily, meaning that everyday Mr. Rajkumar adds a little over 50 rows (US states and US territories) that shows the total coronavirus statistics in that state cumulatively up to that point. It is important to note that in general, there is not an accurate method to count the number of recovered patients, so the cumulative statistics offer the best method to gauge the severity of the coronavirus in each state, even though a significant number of the reported cases have recovered by that date.

Questions Raised

Some questions that we set out to explore in the analysis include the following:

  • What is the trend of the overall dataset? Is it increasing or decreasing? Speeding up or slowing down?
  • Where are the most reported cases? Least reported cases? (Specifically within each U.S state)
  • How has Iowa been affected by the virus over time? Has there been a gradual increase, what does it look like?
  • What is the increase in the amount of people tested in each state over time?
  • What can the percentage of positive testing results tell us about the current testing capacity of each state?
  • Which state is likely to have the most/least confirmed cases in 2 weeks time depending on the current growth rate?
  • Which state has done the greatest/worst job in preventing the spreading of coronavirus?

Obtaining and Cleaning the Dataset

Once we downloaded our raw dataset from kaggle, we noticed that it did not include population statistics for each state, which is important in order to compare states with different populations. Because of this, we included supplementary data from US census numbers from 2019 and joined this dataset with our raw dataset. In addition, to produce the final two categorical mappings shown in our infographic, we added more supplementary data that included the status of each state’s stay at home order (as of April 30) as well as data that showed the dates in which these orders were enacted.

Cleaning all of this data together required careful joining actions within R and special attention to the data types in these columns. The full breakdown showing how we cleaned this data is shown in approximately the first 65 lines of the Nownes.R file.

Exploratory Analysis

Figure 2: Cases/1000 People

Figure 2: Cases/1000 People

We began our exploratory analysis by creating the above figure, showing the number of cases per 1000 people in each state. We see immediately that New York and its surrounding states have the most severe outbreaks of coronavirus at this time.

Figure 3: Cases Per Day

Figure 3: Cases Per Day

The exploratory analysis of our dataset is unique in the sense that we wanted to hit many different aspects in exploring the data. Our first task we wanted to tackle after cleaning the dataset was looking at the trend of the overall dataset. By using ggplot we then were able to produce a bar graph that allowed us to see the new reported cases by day in the United States over time (Figure 3). The U.S has been reporting cases since January 22nd and so when looking at the graph you can see that there is an extremely low growth rate for the first two months. Then around the middle of March you can see a significant jump in the data as cases increase. But occasionally after the month of April you do see the cases subtly fluctuate. Overall we were able to conclude that cases seem to rapidly increase by day and that really as of right now there seems to be no sign of a long term decrease.

Figure 4: Tests Per Day

Figure 4: Tests Per Day

Next, we wanted to look at how the number of tests conducted has affected this overall trend. Again by using ggplot we were able to produce the above bar graph that portrayed the number of tests conducted by day in the United States (Figure 4). Viewing both of these two graphs we can conclude that the number of new reported cases by day increases with the number of tests taken by day. It is clear to see a similar trend in both of these graphs as we see increase in each of them day by day. One thing we did want to mention after doing this part of exploring the overall dataset is that we couldn’t fully reflect a real trend of the coronavirus in the US due to the fact that there are mostly likely individuals who have the virus but of course were not tested. Also this could possibly answer why there is no data from the first two months and allowing us to consider that the pandemic might have started earlier than March.

Figure 5: Testing Over Time

Figure 6: New York

Figure 7: Alaska

Figure 8: Wyoming

Next, we wanted to look at the increase in the amount of people tested in each state over time. Using ggplot, we were able to produce a graph that displayed the testing trajectory for each state. The most notable things we found were: New York had the most increase in the amount of people tested, while Wyoming had the least. Additionally, some other states that had a high amount of people tested were California, Florida, and Texas, which makes sense knowing the populations of these states. We then wanted to look at where exactly in the US were the least reported cases and the most reported cases. Using plot_ly, we were able to produce an interactive graph that displayed the number of tests given, the number of positive cases, and the number of deaths for New York and Alaska as each day passed. These states were found using the min and max functions on the “positive” column for May 1st, 2020, which indicated the maximum and minimum number of reported positive cases. The increasing trend we established earlier was still present in these graphs; however, we were also able to see a linear trend, as well as New York beginning to flatten its number of positive cases curve, which is a good sign. For the final part of looking into where the most and least reported cases were located, we wanted to know why Wyoming did not have the least despite having the lowest increase in people tested. To do so, we compared Alaska and Wyoming to find any explanations. We saw that both of these states had relatively small populations (below 1 million); however, Alaska had 195 less positive cases than Wyoming despite having more than two times the amount of people tested. One major difference we found that could explain such a difference is whether or not a stay at home order was issued. Alaska issued theirs on March 25th, 2020, while Wyoming has yet to enact one. This shows the impact that staying at home can have on the number of cases in a state.

Figure 9: Iowa Counties

Figure 9: Iowa Counties

Next, we thought it would be interesting to look specifically into the state of Iowa and how it has been affected by the virus. It is important to point out that unlike many states the governor of Iowa never issued a stay at home order. So we then created a similar graph to the one above using plotly that shows the number of positive tests and deaths in Iowa over the course of time. We were able to see that there is a gradual increase in the number of cases that tested positive in Iowa. We concluded that this possibly was an effect due to not issuing a stay at home order. Continuing our analysis in Iowa we wanted to take it one step further and specifically look into the counties of Iowa and see how the population of each county had an effect on the amount of cases there were. By using ggplot we were able to create a bar graph that displayed the most number of cases in each county and was colored by the population of each county. We didn’t end up using all counties in Iowa just because some of them did not have any data reported. Just as we expected, more highly populated counties such as Polk County and BlackHawk county have a lot more cases which you are able to see in the graph. It isn’t surprising that Polk County has the most cases since this is where Des Moines is located. Another example you can see is Lynn county as they have reported a large number of cases which again is not surprising since Cedar Rapids is located here which is another highly populated city in Iowa.

Figure 10: Testing over time

Figure 10: Testing over time

Next, we thought it would be interesting to look at COVID-19 testing overall in the U.S. We wanted to see what we could find from comparing the number of tests given to the number of tests that resulted in positive to see if states were doing a good job of increasing their testing or not. Specifically we were curious to see how the testing in each state has increased over time and how it correlated with the amount of cases that resulted in positive cases. We used ggplot to create a line graph of each state that displayed two lines. One line shows the number of tests conducted and the other line shows the number of tests being positive. Ideally we would like to see testing increase and the amount of tests that are positive decrease so that essentially there is a gap that forms between the two lines in each of the graphs. For example, you can see the gap that I am referring to if you compare New Jersey and West Virginia. New Jersey shows that as more people are getting tested the more positive test results are reported. This means that New Jersey needs to keep increasing their testing until the gap in their specific graph begins to widen. On the other hand West Virginia seems to be monitoring their testing well and as they increase their testing, their positive results remain constant and a gap starts to display. Eventually we’d like all states to look like West Virginia as far as testing goes.

Answers to Questions Raised

Conclusion

In the big picture, there is no indication of long term decline of new reported cases by day in the United States as the number has been fluctuating. However, based on the graph, we can see that the number of new reported cases per day has maintained at a stable state due to the implementation of social restrictions from the government. In addition, due to limited testing capacity, the percentage of positive testing in some severe states like Connecticut, New Jersey and Maryland is higher than average. As a result, the government has to increase the testing capacity in order to bring down the percentage of positive testing to a safer benchmark.

Personal Contributions

John Nownes

My main contribution to this project was in cleaning and joining all pieces of data together, and creating the featured infographic (Figure 1). I also set up the video recordings, and helped translate this Final Report document from a Google Drive file into this Rmd file, with a special focus on formatting this report. A full summary of the work I did can be seen in the Nownes.R file.

Sonia Thomas

For this project I found the raw dataset from Kaggle and we were all in agreement to do our research on COVID-19. I then wanted to focus on the virus specifically in Iowa and in the counties of Iowa. John suggested that it would be interesting to focus on the testing in the U.S and how that has increased over time and what impact it had on the amount of cases that were positive so I also focused on that as well. To tackle my first task I had to obtain data from the sources listed on Population data and COVID-19 in Iowa data above. I then created my own csv file and merged the two together. For creating my graphs I used ggplot and plotly to display my findings. We all contributed to the report and presentation but specifically I presented 4 slides and wrote exploratory analysis about the overall trend in the data, virus specifically in Iowa, and COVID-19 testing in the U.S.

Yi Hang Khor

I have brought out most of the possible questions that can be answered from the dataset to provide better understanding for the audience on our project. In addition, I have used the ggplot function to visualize the trend of the overall dataset to and presented the related portion of the presentation. Lastly, I wrote the part of the question answering and conclusion in the report.

Kobe Pranivong

Although I was not the one who found this dataset, I was the one who suggested using Kaggle to find one due to it being resourceful for the final project of DS201. For this project, I decided to explore the two questions “What is the increase in the amount of people tested in each state over time?” and “Where are the most reported cases? Least reported cases?”. I used the ggplot and plot_ly library functions to visualize the data for my questions. Then, I presented my portion of the presentation, which ended up being 3 slides. Lastly, I wrote the part of the exploratory analysis that included my data.